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Abstract 

The Support Vector (SV) machine is a novel type of learning machine, based on statistical learning theory, 
which contains polynomial classifiers, neural networks, and radial basis function (RBF) networks as special 
cases. In the RBF case, the SV algorithm automatically determines centers, weights and threshold such 
as to minimize an upper bound on the expected test error. 

The present study is devoted to an experimental comparison of these machines with a classical approach, 
where the centers are determined by fc-means clustering and the weights are found using error backprop- 
agation. We consider three machines, namely a classical RBF machine, an SV machine with Gaussian 
kernel, and a hybrid system with the centers determined by the SV method and the weights trained by 
error backpropagation. Our results show that on the US postal service database of handwritten digits, 
the SV machine achieves the highest test accuracy, followed by the hybrid approach. The SV approach is 
thus not only theoretically well-founded, but also superior in a practical application. 
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Figure 1: A simple 2-dimensional classification prob- 
lem: find a decision function separating balls from cir- 
cles. The box, as in all following pictures, depicts the 
region [— 1, l] 2 . 



1 Introduction 

Consider Fig. 1. Suppose we want to construct a radial 
basis function classifier 



Z>(x) 




+ b 



(1) 



(b and c; being constants, the latter positive) separating 
balls from circles, i.e. taking different values on balls and 
circles. How do we choose the centers x;? Two extreme 
cases are conceivable: 

The first approach consists in choosing the centers for 
the two classes separately, irrespective of the classifica- 
tion task to be solved. The classical technique of finding 
the centers by some clustering technique (before tackling 
the classification problem) is such an approach. The 
weights W{ are then usually found by either error back- 
propagation (Rumelhart, Hinton, & Williams, 1986) or 
the pseudo-inverse method (e.g. Poggio & Girosi, 1990). 

An alternative approach (Fig. 2) consists in choosing 
as centers points which are critical for the classification 
task at hand. Recently, the Support Vector Algorithm 
was developed (Boser, Guyon & Vapnik 1992, Cortes & 
Vapnik 1995, Vapnik 1995) which implements the lat- 
ter idea. It is a general algorithm, based on guaranteed 
risk bounds of statistical learning theory, which in par- 
ticular allows the construction of radial basis function 
classifiers. This is done by simply choosing a suitable 
kernel function for the SV machine (see Sec. 2.2). The 
SV training consists of a quadratic programming prob- 
lem which can be solved efficiently and for which we are 
guaranteed to find a global extremum. The algorithm 
automatically computes the number and location of the 
above centers, the weights W{, and the threshold b, in 
the following way: by the use of a suitable kernel func- 
tion (in the present case, a Gaussian one), the patterns 
are mapped nonlinearly into a high-dimensional space. 
There, an optimal separating hyperplane is constructed, 
expressed in terms of those examples which are closest 
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Figure 2: RBF centers automatically found by the Sup- 
port Vector algorithm (indicated by extra circles), using 
Ci = 1 for all i (cf. Eq. 1). The number of SV centers ac- 
cidentally coincides with the number of identifiable clus- 
ters (indicated by crosses found by fc-means clustering 
with k = 2 and k = 3 for balls and circles, respectively) 
but the naive correspondence between clusters and cen- 
ters is lost; indeed, 3 of the SV centers are circles, and 
only 2 of them are balls. Note that the SV centers are 
chosen with respect to the classification task to be solved. 



to the decision boundary (Vapnik 1979). These are the 
Support Vectors which correspond to the centers in input 
space. 

The goal of the present study is to compare real-world 
results obtained with fc-means clustering and classical 
RBF training to those obtained with the centers, weights 
and threshold automatically chosen by the Support Vec- 
tor algorithm. To this end, we decided to undertake a 
performance study combining expertise on the Support 
Vector algorithm (AT&T Bell Laboratories) and classi- 
cal radial basis function networks (Massachusetts Insti- 
tute of Technology). We report results obtained on a US 
postal service database of handwritten digits. 

We have organized the material as follows. In the 
next Section, we describe the algorithms used to train 
the different types of RBF classifiers used in this paper. 
Following that, we present an experimental comparison 
of the approaches. We conclude with a discussion of our 
findings. 

2 Different Ways of Constructing a 
Radial Basis Function Classifier 

We describe three radial basis function systems, trained 
in different ways. In Sec. 2.1, we discuss the first sys- 
tem trained along more classical lines. In the follow- 
ing section (2.2), we discuss the Support Vector algo- 
rithm, which constructs an RBF network whose param- 
eters (centers, weights, threshold) are automatically op- 
timized. In Sec. 2.3, finally, we use the Support Vector 
algorithm merely to choose the centers of the RBF net- 
work and then optimize the weights separately. 



2.1 Classical Spherical Gaussian RBFs: 

We begin by first describing the classical Gaussian RBF 
system. A rf-dimensional spherical Gaussian RBF net- 
work with K centers has the mathematical form 
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where Qi is the i Gaussian basis function with center 
el and variance af . The weight coefficients W( combine 
the Gaussian terms into a single output value and b is 
a bias term. In general, building a Gaussian RBF net- 
work for a given learning task involves (1) determining 
the total number of Gaussian basis functions to use for 
each output class and for the entire system, (2) locating 
the Gaussian basis function centers, (3) computing the 
cluster variance for each Gaussian basis function, and (4) 
solving for the weight coefficients and bias in the summa- 
tion term. One can implement a binary pattern classifier 
on input vectors x as a Gaussian RBF network by defin- 
ing an appropriate output threshold that separates the 
two pattern classes. 

In this first system, we implement each individual 
digit recognizer as a spherical Gaussian RBF network, 
trained with a classical RBF algorithm. Given a spec- 
ified number of Gaussian basis functions for each digit 
class, the algorithm separately computes the Gaussian 
centers and variances for each of the 10 digit classes 
to form the system's RBF kernels. The algorithm then 
solves for an optimal set of weight parameters between 
the RBF kernels and each output node to perform the 
desired digit recognition task. The training process con- 
structs all 10 digit recognizers in parallel so one can re- 
use the same Gaussian basis functions among the 10 digit 
recognizers. To avoid overfitting the available training 
data with an overly complex RBF classifier connected to 
every Gaussian kernel, we use a "bootstrap" like oper- 
ation that selectively connects each recognizer's output 
node to only a "relevant" subset of all basis functions. 
The idea is similar to how we choose relevant "near-miss" 
clusters for each individual digit recognizer in the origi- 
nal system. The training procedure proceeds as follows 
(for further details, see Sung, 1996): 

1. The first training task is to determine an appro- 
priate number k of Gaussian kernels for each digit 
class. This information is needed to initialize our 
clustering procedure for computing Gaussian RBF 
kernels. We opted for using the same numbers of 
Gaussian kernels as the ones automatically com- 
puted by the SV algorithm (see Table 1). 

2. Our next task is to actually compute the Gaussian 
kernels for each digit class. We do this by sepa- 
rately performing classical fc-means clustering (see 
e.g. Lloyd, 1982) on each digit class in the US postal 
service (USPS) training database. Each clustering 
operation returns a set of Gaussian centroids and 



their respective variances for the given digit class. 
Together, the Gaussian clusters from all 10 digit 
classes form the system's RBF kernels. 

3. For each single-digit recognizer, we build an initial 
RBF network using only Gaussian kernels from its 
target class, using error backpropagation to train 
the weights. We then separately collect all the false 
positive mistakes each initial digit recognizer makes 
on the USPS training database. 

4. In the final training step, we augment each initial 
digit recognizer with additional Gaussian kernels 
from outside its target class to help reduce mis- 
classification errors. We determine which Gaus- 
sian kernels are "relevant" for each recognizer as 
follows: For each false positive mistake the initial 
recognizer makes during the previous step, we look 
up the misclassified pattern's actual digit class and 
include the nearest Gaussian kernel from its class in 
the "relevant" set. The final RBF network for each 
single-digit recognizer thus contains every Gaussian 
kernel from its target class, and several "relevant" 
kernels from the other 9 digit classes, trained by 
error backpropagation. Because our final digit rec- 
ognizers have fewer weight parameters than a naive 
system that fully connects all 10 recognizers to ev- 
ery Gaussian kernel, we expect our system to gen- 
eralize better on new data. 

2.2 The Support Vector Machine 

Structural Risk Minimization. For the case of two 
class pattern recognition, the task of learning from ex- 
amples can be formulated in the following way: given a 
set of functions 

{/«:«£ A}, / a :R N ^{-l,+l} 

and a set of examples 

(x 1 ,y 1 ),...,(x l ,y l ), x; £ R N ,y { G {-1,+1}, 

each one generated from an unknown probability distri- 
bution -P(x, y), we want to find a function /„» which 
provides the smallest possible value for the risk 



R(a) = J \f a ( X )-y\dP(x,y). 

The problem is that R(a) is unknown, since -P(x, y) is 
unknown. Therefore an induction principle for risk min- 
imization is necessary. 

The straightforward approach to minimize the empir- 
ical risk 



i^emp \^) 



7 E !/«(**) 



■Vi 



turns out not to guarantee a small actual risk (i.e. a 
small error on the training set does not imply a small 
error on a test set), if the number i of training examples 
is limited. To make the most out of a limited amount 
of data, novel statistical techniques have been developed 
during the last 25 years. The Structural Risk Minimiza- 
tion principle (Vapnik, 1979) is based on the fact that 



for the above learning problem, for any a £ A with a 
probability of at least 1 — rj, the bound 



iJ(a)<iJ emp (a) + $(£,!2lM) 



(2) 



holds, $ being defined as 



h logfr) _ hjlogf + l)-log(r ? /4) 

The parameter /i is called the VC-dimension of a set of 
functions. It describes the capacity of a set of functions 
implementable by the learning machine. For binary clas- 
sification, h is the maximal number of points k which can 
be separated into two classes in all possible 2 k ways by 
using functions of the learning machine; i.e. for each 
possible separation there exists a function which takes 
the value 1 on one class and —1 on the other class. 

According to (2), given a fixed number £ of train- 
ing examples one can control the risk by controlling two 
quantities: R emp (a) and h({f a : a £ A'}); A' denoting 
some subset of the index set A. The empirical risk de- 
pends on the function chosen by the learning machine 
(i.e. on a), and it can be controlled by picking the right 
a. The VC-dimension h depends on the set of functions 
{fa '■ oc £ A'} which the learning machine can imple- 
ment. To control h, one introduces a structure of nested 
subsets S n := {/„ : a £ A n } of {/„ : a £ A}, 

Si C S 2 C . . . C S„ C . . . , (3) 

with the corresponding VC-dimensions satisfying 

h\ < h'2 < • • • < h n < . . . 

For a given set of observations (xi, j/i), ..., (x^, yi) the 
Structural Risk Minimization principle chooses the func- 
tion f a n in the subset {/„ : a £ A n } for which the 
guaranteed risk bound (the right hand side of (2)) is 
minimal. 

The remainder of this section follows Scholkopf, 
Burges & Vapnik (1995) in briefly reviewing the Sup- 
port Vector algorithm. For details, the reader is referred 
to (Vapnik, 1995). 

A Structure on the Set of Hyperplanes. Each par- 
ticular choice of a structure (3) gives rise to a learning 
algorithm. The Support Vector algorithm is based on a 
structure on the set of hyperplanes. To describe it, first 
note that given a dot product space Z and a set of vectors 
xi, . . . , x r £ Z, each hyperplane {x £ Z : (w -x) + & = 0} 
corresponds to a canonical pair (w, b) £ Z x R if we ad- 
ditionally require 



min |(w • x 8 ') + &| 

= l.....r 



1. 



(4) 



Let 5 Xl ,...,x r = {x £ Z : ||x - a|| < R} (a £ Z) be the 
smallest ball containing the points xi, . . . , x r , and 



fw,b = sgn((w -x) + 6) 



(5) 



the decision function defined on these points. The pos- 
sibility of introducing a structure on the set of hyper- 
planes is based on the result (Vapnik, 1995) that the set 
{/w,i : ||w|| < A} has a VC-dimension h satisfying 

h<R 2 A 2 . (6) 



Note. Dropping the condition ||w|| < A leads to a set 
of functions whose VC-dimension equals N + 1, where 
N is the dimensionality of Z. Due to ||w|| < A, we can 
get VC-dimensions which are much smaller than N , en- 
abling us to work in very high dimensional spaces. 

The Support Vector Algorithm. Now suppose we 
want to find a decision function /w,6 with the property 
/w,i(x 8 ) = yi, i = 1, . . . ,£. If this function exists, canon- 
icality (4) implies 



2/i((w -Xi) + &)>!, i= 1, 



(7) 



In many practical applications, a separating hyperplane 
does not exist. To allow for the possibility of examples 
violating (7), Cortes & Vapnik (1995) introduce slack 
variables 

6->0, i =!,...,£, (8) 



to get 



?/i((w -Xi) + 6) > 1 -&, i = l, 



(9) 



The Support Vector approach to minimizing the guaran- 
teed risk bound (2) consists in the following: minimize 



$(w,£) = (w 



w ) + 7 YI & 



(10) 



subject to the constraints (8) and (9). According to 
(6), minimizing the first term amounts to minimizing the 
VC-dimension of the learning machine, thereby minimiz- 
ing the second term of the bound (2). The term X2i=i &> 
on the other hand, is an upper bound on the number of 
misclassifications on the training set — this controls the 
empirical risk term in (2). For a suitable positive con- 
stant 7, this approach therefore constitutes a practical 
implementation of Structural Risk Minimization on the 
given set of functions. 

Introducing Lagrange multipliers a.{ and using the 
Kuhn-Tucker theorem of optimization theory, the solu- 
tion can be shown to have an expansion 



8 = 1 



yi^l^-i : 



(11) 



with nonzero coefficients a.{ only for the cases where the 
corresponding example (x;,?/;) precisely meets the con- 
straint (9). These x; are called Support Vectors. All 
the remaining examples x; of the training set are irrele- 
vant: their constraint (9) is satisfied automatically (with 
£j- = 0), and they do not appear in the expansion (11). 
The coefficients a.{ are found by solving the following 
quadratic programming problem: maximize 

i i 

w ( a ) = J2 ai ~2 Y yiyj a i a i(*i ■ x i) (12) 

Z = l 2J=1 

subject to 



< a.{ < 7, i = 1, 



8 = 1 



ctiVi 



0. (13) 




Figure 3: A simple two-class classification problem as 
solved by the Support Vector algorithm (c; = 1 for all 
i; cf. Eq. 1). Note that the RBF centers (indicated by 
extra circles) are closest to the decision boundary. 



By linearity of the dot product, the decision function (5) 
can thus be written as 

/(x) = sgn I ^ Vi a i • (x • x,) + 6 J . 

So far, we have described linear decision surfaces. To 
allow for much more general decision surfaces, one can 
first nonlinearly transform the input vectors into a high- 
dimensional feature space by a map <f> and then do a 
linear separation there. Maximizing (12) then requires 
the computation of dot products (</>(x) • <^(x;)) in a high- 
dimensional space. In some cases, these expensive calcu- 
lations can be reduced significantly by using a suitable 
function K such that 

(<f>(x) -<f>(xi)) = A(x,x;). 

We thus get decision functions of the form 



/(x) = sgn [) yiai ■ A(x, x*) + b 



(14) 



In practise, we need not worry about conceiving the map 
<f>. We will choose a K which is the Kernel of a posi- 
tive Hilbert-Schmidt operator, and Mercer's theorem of 
functional analysis then tells us that K corresponds to 
a dot product in some other space (see Boser, Guyon & 
Vapnik, 1992). Consequently, everything that has been 
said above about the linear case also applies to nonlinear 
cases obtained by using a suitable kernel K instead of 
the Euclidean dot product. We are now in a position to 
explain how the Support Vector algorithm can construct 
radial basis function classifiers: we simply use 



A(x,x;) = exp (-||x -Xi|| 2 /c) 



(15) 



(see Aizerman, Braverman & Rozonoer, 1964). Other 
possible choices of K include 

A(x,x;) = (x-x;) d , 




Figure 4: Two-class classification problem solved by the 
Support Vector algorithm (c; = 1 for all i; cf. Eq. 1). 



yielding polynomial classifiers (d £ N), and 
A'(x, x 8 ) = tanh(« • (x • x 8 ) + 0) 

for constructing neural networks. 

Interestingly, these different types of S V machines use 
largely the same Support Vectors; i.e. most of the centers 
of an SV machine with Gaussian kernel coincide with 
the weights of the polynomial and neural network SV 
classifiers (Scholkopf, Burges & Vapnik 1995). 

To find the decision function (14), we have to maxi- 
mize 



W(a) 



i i 



(16) 



under the constraint (13). To find the threshold b, one 
takes into account that due to (9), for Support Vectors 
x,' for which £,• = we have 



'Y^ViCti • A(xj,Xi) + b 



Vi- 



Finally, we note that the Support Vector algorithm 
has been empirically shown to exhibit good generaliza- 
tion ability (Cortes & Vapnik, 1995). This can be fur- 
ther improved by incorporating invariances of a problem 
at hand, as with the Virtual Support Vector method 
of generating artificial examples from the Support Vec- 
tors (Scholkopf, Burges, & Vapnik, 1996). In addition, 
the decision rule (14), which requires the computation of 
dot products between the test example and all Support 
Vectors, can be sped up with the reduced set technique 
(Burges, 1996). These methods have led to substantial 
improvements for polynomial Support Vector machines 
(Burges & Scholkopf, 1996), and they are directly appli- 
cable also to RBF Support Vector machines. 

2.3 A Hybrid System: SV Centers Only 

The previous section discusses how one can train RBF 
like networks using the Support Vector algorithm. This 



Digit Class 





1 


2 


3 


4 


5 


6 


7 


8 


9 


# of SVs 


274 


104 


377 


361 


334 


388 


236 


235 


342 


263 


# of pos. SVs 


172 


77 


217 


179 


211 


231 


147 


133 


194 


166 



Table 1: Numbers of centers (Support Vectors) automatically extracted by the Support Vector 
machine. The first row gives the total number for each binary classifier, including both positive 
and negative examples; in the second row, we only counted the positive SVs. The latter number 
was used in the initialization of the fc-means algorithm, cf. Sec. 2.1. 





digit 





1 


2 


3 


4 


5 


6 


7 


8 


9 




classical RBF 


20 


16 


43 


38 


46 


31 


15 


18 


37 


26 


full SVM 


16 


8 


25 


19 


29 


23 


14 


12 


25 


16 


SV centers only 


9 


12 


27 


24 


32 


24 


19 


16 


26 


16 


Table 2: Two 
systems descr 


-class-classification: numbers of test errors (out of 2007 test patterns) 
ibed in Sections 2.1 - 2.3. 


for the three 



involves the choice of an appropriate kernel function K 
and solving the optimization problem in the form of 
Eq. (16). The Support Vector algorithm thus automati- 
cally determines the centers (which are the Support Vec- 
tors), the weights (given by yion), and the threshold b 
for the RBF machine. 

To assess the relative influence of the automatic SV 
center choice and the SV weight optimization, respec- 
tively, we built another RBF system, constructed with 
centers that are simply the Support Vectors arising from 
the SV optimization, and with the weights trained sep- 
arately. 

3 Experimental Results 

Toy examples. What are the Support Vectors? They 
are elements of the data set that are "important" in sep- 
arating the two classes from each other. In general, the 
Support Vectors with zero slack variables (see Eq. 8) lie 
on the boundary of the decision surface, as they precisely 
satisfy the inequality (9) in the high-dimensional space. 
Figures 3 and 4 illustrate that for the used Gaussian 
kernel this is also the case in input space. 

This raises an interesting question from the point of 
view of interpreting the structure of trained RBF net- 
works. The traditional view of RBF networks has been 
one where the centers were regarded as "templates" or 
stereotypical patterns. It is this point of view that leads 
to the clustering heuristic for training RBF networks. 
In contrast, the Support Vector machine posits an alter- 
nate point of view, with the centers being those examples 
which are critical for a given classification task. 

US Postal Service Database. We used the USPS 
database of 9300 handwritten digits (7300 for training, 
2000 for testing), collected from mail envelopes in Buf- 
falo (cf. LeCun et al., 1989). Each digit is a 16 x 16 
vector with entries between —1 and 1. Preprocessing 
consisted in smoothing with a Gaussian kernel of width 
a = 0.75. The Support Vector machine results reported 
in the following were obtained with j = 10 (cf. (10)) and 



c = 0.3 -16-16 (cf. (15)). 1 In all experiments, we used 
the Support Vector algorithm with standard quadratic 
programming techniques (conjugate gradient descent). 

Two-class classification. Table 1 shows the numbers 
of Support Vectors, i.e. RBF centers, extracted by the 
SV algorithm. Table 2 gives the results of binary clas- 
sifiers separating single digits from the rest, for the sys- 
tems described in Sections 2.1, 2.2, and 2.3. 

Ten-class classification. For each test pattern, the 
arbitration procedure in all three systems simply re- 
turns the digit class whose recognizer gives the strongest 
response. 2 Table 3 shows the 10-class digit recognition 
error rates for our original system and the two RBF- 
based systems. 

The fully automatic Support Vector machine exhibits 
the highest test accuracy. Using the Support Vector 
algorithm to choose an appropriate number and corre- 
sponding centers for the RBF network is also better than 
the baseline procedure of choosing the centers by a clus- 
tering heuristic. It can be seen that in contrast to the 
fc-means cluster centers, the centers chosen by the Sup- 
port Vector algorithm allow zero training error rates. 

4 Summary and Discussion 

The Support Vector algorithm provides a principled way 
of choosing the number and the locations of RBF cen- 
ters. Our experiments on a real-world pattern recogni- 
tion problem have shown that compared to a correspond- 
ing number of centers chosen by k-means, the centers 
chosen by the Support Vector algorithm allowed a train- 
ing error of zero, even if the weights were trained by 
classical RBF methods. Our interpretation of this find- 
ing is that the Support Vector centers are specifically 



The SV machine is rather insensitive to different choices 
of c: for all values in 0.1, 0.2, . . . , 1.0, the performance is 
about the same (in the area of 4% — 4.5%). 

In the Support Vector case, we constructed ten two-class 
classifiers, each trained to separate a given digit from the 
other nine, and combined them by doing the ten-class clas- 
sification according to the maximal output (before applying 
the sgn function) among the two-class classifiers. 



USPS Database 


Classification Error Rate 


Clustered Centers 


S.V. centers 


Full S.V.M. 


Training (7291 patterns) 


1.7% 


0.0% 


0.0% 


Test (2007 patterns) 


6.7% 


4.9% 


4.2% 



Table 3: 10-class digit recognition error rates for three RBF classifiers constructed with different algo- 
rithms. The first system is a more classical one choosing its centers by a clustering heuristic. The other 
two are the Gaussian RBF-based systems we trained, one with the Support Vectors were chosen to be 
the centers and the second where the entire network was trained using the Support Vector algorithm. 



chosen for the classification task at hand, whereas k— 
means does not care about picking those centers which 
will make a problem separable. 

In addition, the SV centers yielded lower test error 
rates than k-means. It is interesting to note that using 
SV centers, while sticking to the classical procedure for 
training the weights, improved training and test error 
rates by approximately the same margin (2 per cent). 
In view of the guaranteed risk bound (2), this can be 
understood in the following way: the improvement in 
test error (risk) was solely due to the lower value of 
the training error (empirical risk); the confidence term 
(the second term on the right hand side of (2)), depend- 
ing on the VC-dimension and thus on the norm of the 
weight vector (Eq. 6), did not change, as we stuck to the 
classical weight training procedure. However, when we 
also trained the weights with the Support Vector algo- 
rithm, we minimized the norm of the weight vector (see 
Eq. 10) and thus the confidence term, while still keeping 
the training error zero. Thus, consistent with (2), the 
Support Vector machine achieved the highest test accu- 
racy of the three systems. 
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